Segment-based multiple sequence alignment

نویسنده

  • Amarendran Ramaswami Subramanian
چکیده

In this PhD thesis the segment-based approach for multiple sequence alignment, initially introduced by the DIALIGN program, is thorougly investigated and substiantially improved. The segment-based approach belongs to the class of local alignment methods and thus is very strong in finding locally conserved motifs, whereas global methods align the input sequences globally from the beginning to end without specifically looking at locally occuring conserved motifs. Local alignments and especially segment-based methods therefore play an important role in molecular biology research, which is underscored by the fact that the results of this PhD thesis have already been extensively used in various biological research areas. Initially we present a complete re-implementation of the DIALIGN approach in chapter 3 – DIALIGN-T – which also embraces several improvements, such as the exclusion of low-scoring sub-fragments and weight score factors yielding a statistical superior method on local and global benchmark databases. However DIALIGN-T still uses a greedy and, therefore, very naive strategy to build the final alignment so that in chapter 4 we re-formulate the assembling phase as an optimization problem that is NP-complete, but for which we can proove it to be a fixed parameter tractable (FPT) in the number of input sequences, under reasonable assumptions. Since we are interested in approaches that are useful in practice, we develop a plane-sweep algorithm that optimally solves the assembling problem whereby its computational time basically grows with the number of simultaneously occuring conflicting situations. By exploiting the ideas of the plane-sweep algorithm, we extend it, in chapter 5, to a full algorithmic framework which acts as a basis for developing further optimal or near-optimal heuristics for assembling an alignment from a given set of input similarities. Inspired by this framework, we improve, in chapter 6, DIALIGN-T to its most recent version DIALIGN-TX, which incorporates substiantial improvements by combining greedy and progressive strategies for assembling the alignment. In order to measure the quality of our improvements, we used the standard benchmark databases BAliBASE and BRaliBASE II on global alignments and the artifically generated databases IRMBASE and DIRMBASE on local alignments. The results show that DIALIGN-TX is currently outperforming all other methods on the local benchmark databases while still providing very good results on global alignments, i.e. it even outperforms the very popular global alignment program CLUSTAL W on the global benchmark database BAliBASE 3. Altogether we conclude that DIALIGN-TX is one of the strongest methods on the important class of local alignments while still providing very good results on global alignments and consuming in practice only a reasonable amount of computational time. In combination with the algorithmic framework we obtain a rich basis or future improvements to the segment-based approach for computing general and (biological) domain-specific multiple sequence alignments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

Divide-and-conquer multiple alignment with segment-based constraints

A large number of methods for multiple sequence alignment are currently available. Recent benchmarking tests demonstrated that strengths and drawbacks of these methods differ substantially. Global strategies can be outperformed by approaches based on local similarities and vice versa, depending on the characteristics of the input sequences. In recent years, mixed approaches that include both gl...

متن کامل

Segment-based multiple sequence alignment

MOTIVATION Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. RESULTS We introduce a graph-based extension to the consistency-based, progressive alignment...

متن کامل

An exact solution for the Segment-to-Segment multiple sequence alignment problem

MOTIVATION In molecular biology, sequence alignment is a crucial tool in studying the structure and function of molecules, as well as the evolution of species. In the segment-to-segment variation of the multiple alignment problem, the input can be seen as a set of non-gapped segment pairs (diagonals). Given a weight function that assigns a weight score to every possible diagonal, the goal is to...

متن کامل

DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment

MOTIVATION The performance and time complexity of an improved version of the segment-to-segment approach to multiple sequence alignment is discussed. In this approach, alignments are composed from gap-free segment pairs, and the score of an alignment is defined as the sum of so-called weights of these segment pairs. RESULTS A modification of the weight function used in the original version of...

متن کامل

A generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences

The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009